Multi‑label classification of biomedical data

Biomedical datasets constitute a rich source of information, containing multivariate data collected during medical practice. In spite of inherent challenges, such as missing or imbalanced data, these types of datasets are increasingly utilized as a basis for the construction of predictive machine-learning models. The prediction of disease outcomes and complications could inform the process of decision-making in the hospital setting and ensure the best possible patient management according to the patient's features. Multi-label classification algorithms, which are trained to assign a set of labels to input samples, can efficiently tackle outcome prediction tasks. Myocardial infarction (MI) represents a widespread health risk, accounting for a significant portion of heart disease-related mortality. Moreover, the danger of potential complications occurring in patients with MI during their period of hospitalization underlines the need for systems to efficiently assess the risks of patients with MI. In order to demonstrate the critical role of applying machine-learning methods in medical challenges, in the present study, a set of multi-label classifiers was evaluated on a public dataset of MI-related complications to predict the outcomes of hospitalized patients with MI, based on a set of input patient features. Such methods can be scaled through the use of larger datasets of patient records, along with fine-tuning for specific patient sub-groups or patient populations in specific regions, to increase the performance of these approaches. Overall, a prediction system based on classifiers trained on patient records may assist healthcare professionals in providing personalized care and efficient monitoring of high-risk patient subgroups.


Introduction
Machine learning is a subset of artificial intelligence, aimed at developing 'intelligent' algorithms that harness data to execute tasks with optimal performance (1).Machine learning algorithms can be broadly split into four categories: Supervised, semi-supervised, unsupervised and reinforcement-learning (2).In supervised learning, the algorithm is given a dataset known as 'training data', where each training sample corresponds to one or several inputs and the desired output (3).through an iterative process, the algorithm determines a function which can correctly predict the desired output from a set of new, previously unknown inputs.Supervised machine-learning tasks include classification, regression and forecasting (4).Unsupervised learning, conversely, is carried out on unlabeled datasets with the aim of extracting patterns and information without external supervision (5).Semi-supervised learning, as indicated by the name, falls somewhere between supervised and unsupervised techniques, as only a portion of the training data is labeled (6).lastly, during reinforcement learning tasks, an intelligent agent takes actions in a set environment (7). the actions return a reward, while also influencing the environment and the state of the agent.the goal of the agent is to 'learn' the policy which maximizes the reward function, or more generally, maximizes the reinforcement signal that is generated by the rewards (8).
Machine learning approaches, as a whole, have become increasingly relevant in the current era of 'big data'.Data technologies are rapidly evolving, with data storage sizes entering petabytes, cloud services enabling high data transfer speed and computational systems shifting towards high performance cluster computing (9)(10)(11).This has allowed the implementation of machine learning in a range of diverse fields, including healthcare.a trove of biomedical data is generated daily throughout the process of medical practice and patient care.Examples of such data include imaging results (ultrasounds, magnetic resonance imaging, computed tomography scans), laboratory test results (cell cultures, biological material analyses and sequencing), patient medical history, drug effects and interactions and patient health outcomes (12).These can be regarded as attractive targets for the implementation of machine learning algorithms, wherein the desired objective is tailored to the respective challenge.
the prediction of disease complications constitutes a key aspect of patient treatment and management (13).The study published in 2022 by Ghosheh et al (14) investigated the use of a predictive system to predict the risk of developing complications in patients diagnosed with coronavirus disease 2019 (COVID- 19), trained on data of >3,000 patients in the United arab Emirates.the prediction of disease complications, overall, can be interpreted as a predictive classification problem, thus opening the door to the application of supervised machine learning algorithms.the current status of the patient can be analyzed by diagnostic classification models, while potential outcomes can be predicted by prognostic classification models (15).Predictive classification algorithms have been implemented in the context of various diseases within the past years.During the peak of the recent severe acute respiratory syndrome coronavirus 2 (SarS-CoV-2) pandemic, classification models were designed to tackle various aspects of the disease, such as the detection of viral infection through X-ray imaging and Ct scans or the prediction of outcomes of patients with COVID-19 using their recorded characteristics as input (16)(17)(18).This is even more important under the scope of personalized medicine, as the individual patient profile, which includes features, such as comorbidities, age, ethnic background etc., may affect disease progression and clinical manifestations (19).In a number of cases, several conditions, or labels, may be assigned to a single patient; for example, an individual may be suffering from COVID-19, while also exhibiting cardiovascular issues and high cholesterol.In such a case, the challenge of building predictive models can be regarded as a multi-label classification problem.
This type of classification task falls under the supervised machine learning 'umbrella' and constitutes a modeling problem where the class, also known as target or label, is predicted for a data point, otherwise known as input sample (20).In single-label classification, from a collection of discrete L (L>1) labels, a single label 'l' is assigned to each input sample (21).If the number of labels L is 2, the task is considered a binary classification problem, whereas if the number of labels L>2, it is considered a multiclass classification problem (22).In the case where each input sample is associated with a set of non-exclusive class labels, the task is called multi-label classification (23).
One of the common challenges in classification and even more so in the case of multi-label, biomedical datasets, is the presence of imbalanced data (24).In the multi-label setting, imbalance can be traced to three distinct levels, imbalance within labels, imbalance between labels and imbalance within label sets (25).In inter-label imbalance, the label column contains a disproportionate ratio of negative vs. positive samples, effectively obscuring the signal of the particular label (26).In intra-label imbalance, there is difference in frequency between labels, with frequency most often termed as the number of positive instances (27).Labels with an abundance of positive samples are considered in majority, while the rest are minority labels (28).In multi-label classification, there is a possibility for a sample to simultaneously contain a label in minority and another label that is in majority.Labels in majority are termed head labels and labels in minority tail label (29).the last source of multi-label imbalance is the existence of label sets, where some sets may be more frequent in the dataset than others (30)(31)(32).
Myocardial infarction (MI), or more generally known as a 'heart attack', is caused by a decrease or total interruption in blood flow to part of the myocardium (33).As per the World Health Organization, more than four out of five cardiovascular disease-related deaths are due to heart attacks and strokes, while a third of these deaths concern individuals aged <70 years (34).thrombosis has been established as the most prominent driver of acute MI, itself stemming from events of atherosclerosis and inflammation (35).Atherosclerotic lesions emerge as thickening of the coronary inner walls of the artery or as fatty streaks, ultimately leading to thrombus formation (36).During their hospitalization, patients with acute MI often face severe complications, such as shock, stroke, atrioventricular block, respiratory failure and cardiopulmonary arrest (37)(38)(39).Shock and posterior cerebral artery, in particular, constitute prevalent causes of mortality in patients with acute MI (40).Establishing a robust plan of action is inextricably linked to the successful management of MI-related complications in the hospital setting.Therefore, the application of a classification model to predict potential MI complications may prove useful in informing the decision-making of medical professionals.
In the present study, to demonstrate the potential of leveraging biomedical patient data in the context of predictive classification, a multi-label classification model was trained on a recently released, public dataset of MI patient data.

Data and methods
Data collection and preprocessing.Machine-learning tasks require, first and foremost, a solid data-related foundation, for the implementation and validation of the framework and its constituent algorithms.To explore the multi-label classification task, a public dataset regarding the outcomes of patients with MI was retrieved (41).The original dataset consists of 114 descriptive metrics collected for a total of 1,700 patients with MI, along with information regarding patient outcomes and complications, split into 12 candidate categories.Descriptive analytics of the dataset and information regarding the convention of column names can be found in table SI. the recorded patient metrics are of mixed nature, falling into one of the following types: Binary, real and ordinal.The binary data stems from two attribute-handling approaches during the dataset's original creation.First, categorical variables were dummy coded into 0 and 1; for example, the column 'SEX' where values 'female' and 'male' were arbitrarily encoded to 0 and 1, respectively.Secondly, binary encoding was used to indicate presence or absence of attributes, such as in the 'symptomatic hypertension' (SIM_GIPERT) column, where presence or absence was encoded to 1 and 0, respectively.The real, or numerical data, within the dataset mainly concern patient measurements taken during assessment by the medical professionals and throughout hospitalization, such as 'systolic blood pressure according to Emergency Cardiology team' (S_aD_KBrIG).lastly, the ordinal feature data contained values with intrinsic order; for example, the column 'presence of essential hypertension' (GB), which took values 0, 1, 2 or 3, corresponding to absence, stage 1, stage 2 and stage 3 essential hypertension.An overview of the steps of the pipeline is presented in Fig. 1.
as per the dataset's curators, there are four candidate time settings for the construction of the prediction challenge: the time of admission to hospital, 24, 48 and 72 h following hospital admission.the selection of time setting determines the columns which can be used as input data during model fitting.The end of the first day (24 h post-hospital admission), was selected as the time setting for the present implementation of the classification algorithm, leading to the exclusion of six input columns, plus the patient ID column which serves no predictive purpose.
Missing data constitutes a common issue in biomedical datasets, as the process of data collection is error-prone, particularly in a hospital setting where such processes are rarely automated and are most often handled by the doctors and nurses themselves.Discarding every single sample/patient record that is missing a portion of the input features could harm the potential of the classifier, as it would markedly lessen the amount of information available to train the classifier on.a common strategy to remedy this problem is data imputation, where the missing data are imputed by various methods.In the present pipeline, input columns which contained missing values above a threshold of 85% of the total number of rows were removed, and a multivariate imputer was used to estimate the missing values in the rest of the dataset.lastly, a baseline for the type of target data was set.As aforementioned, the output (target) data span columns 113-124 of the original dataset.all columns but one contained data in binary form, denoting presence or absence of a complication/outcome.To maintain the binary type uniform across the output data, the singular column 'lethal outcome' (LET_IS), which contained non-ordinal, numerical data, was one-hot encoded.To one-hot encode a variable with n possible, non-ordinal values, n separate columns are generated and the presence or absence of the value in a sample is denoted by 1 and 0, respectively.In the case at hand, numbers 0-7 had been assigned to the outcomes 'alive', 'cardiogenic shock', 'pulmonary edema', 'myocardial rupture', 'progress of congestive heart failure', 'thromboembolism', 'asystole', 'ventricular fibrillation', and were subsequently split into eight separate binary data columns.The final dataset can be found in table SII.additionally, a detailed description of the original MI database, descriptive statistics and a list of the column abbreviations can be found via the following weblink: 10.25392/leicester.data.12045261.v3.
Label relations exploration.In classification tasks, it is useful to explore the label space and the associations between the labels reported within the dataset.Graphs are increasingly used in the study of complex systems, such as protein or traffic networks, enabling the implementation of embedding algorithms (42).In the multi-label setting, graphs represent multiple levels of information, as graph edges could reflect a range of relationships between labels, from simpler to more complex ones (43).Furthermore, the study of clustering and interactions between labels could elucidate more obscure factors underlying the network's structure (44).For example, a network of comorbidities represented as a multi-label graph could provide insight into subtle interplays between pathological conditions.to explore the graph space, the label Cooccurrence Graph Builder class was imported from the skmultilearn.clustermodule as the graph builder base (45).The NetworkXLabelGraphClusterer class from the same module was used to study the community and clustering trends across the label instances by use of the louvain method (46), and NetworkX was used to visualize the graph (47).
Data-driven model selection and hyperparameter tuning.the selection of a classification algorithm is entirely dependent on the nature of the multioutput problem concerned, and there is no established guideline for choosing a model.Two factors to be considered during the selection process are performance and efficiency.Efficiency is inherently tied to model aspects, such as its scalability, the type of label combinations within the dataset, and so on.A classification algorithm that exhibits high performance may suffer from low efficiency; for example, choosing a model that trains a single classifier per label would be unsuitable or too slow for a task with a large label space.Performance may be viewed as the model's generalization capability and there exist several evaluation metrics that can be employed.Precision measures the model's reliability in classifying a sample as positive, accuracy measures how well the model performs across all the classes of the dataset (48) and recall represents the ratio of how many of the actual labels were correctly predicted (49).While the aforementioned metrics hold up well in cases of multi-class classification, in multi-label classification, where the model's predictions can range from fully or partially correct to fully incorrect, the adjustment of evaluation measures is required to reflect these subtleties (50).Hamming loss (HL) measures the hamming distance between the true and the predicted label, penalizing the incorrectly predicted labels in a predicted label set individually; therefore, the metric is capable of reflecting the notion of partially correct model predictions (23).the metric ranges from 0 to 1 and the lower the HL, the better performance is exhibited by the multi-label classification model.
to carry out data-driven model selection, a set of algorithms were trained on the dataset and their performance was evaluated by the HL metric.Parameters which control the model's architecture are termed 'hyperparameters' and the process of exploring possible choices towards an optimal model architecture is termed 'hyperparameter tuning' (51).To carry out the step, aa cross-validated grid search was employed.The method, which is available through the GridSearchCV class of the scikit-learn module, entails an exhaustive search over a parameter grid, in order to yield optimal model parameters (52).Inputting an integer for the 'cv' parameter of the class enables a stratified k-fold split, where the dataset is divided into k partitions and for each split, a search through a user-set range of the hyperparameter spaces is executed, fitting and scoring each combination to elect the hyperparameters which lead to the best model performance (53).The pool of candidate classification algorithms contained the following: Classifier Chains (CC) with Random Forest as the base classifier, Classifier Chains with XGBoost as the base classifier, Binary Relevance k-Nearest Neighbors (BRkNN) Classifier, random Forest (rF) Classifier, Multi-label k-Nearest Neighbors Classifier (MLkNN) and OneVsRest with XGboost, all of which are available through the scikit-learn and XGBoost libraries (54,55).
Extreme Gradient Boosting (XGBoost) is an advanced implementation of the Gradient Boosting decision tree algorithm (55).Gradient Boosting builds the first learner, a decision tree, on the training dataset to perform the prediction of the target samples, then calculates the loss, which is the difference between the true value (or true label) and the predicted value that has been generated from the first learner (56).The residual of the loss function is calculated using the Gradient Descent Method and is used as the target variable for the next iteration, where an improved learner is built (57).In brief, numerous models are trained sequentially, and the algorithm aims to boost them into a strong learner which best predicts the target.XGBoost implements parallel processing, increasing the algorithm's speed ten-fold, compared to standard Gradient Boosting (55).Furthermore, the algorithm is flexible, allowing the user to select custom optimization objectives and fine-tune booster and task parameters.XGBoost does not naively support multi-output classification, hence, to implement extreme gradient boosting in the multi-label classification problem, the XGBoost model was wrapped inside the MultiOutputClassifier class from the scikit-learn module (54).
Training and K-fold cross validation.In traditional machine-learning model development, the model is trained on a partition of the data, called the training set, and a set of the data unseen to the model during training is used as the test set, to evaluate the performance of the algorithm based on new data (58,59).K-fold cross validation entails dividing the dataset into k non-overlapping groups of rows, then training the machine-learning model on all the groups save for a hold-out fold, which is then used as the test set (60).The process is repeated across all folds, until each fold has been used as the hold-out test set, and model performance is averaged across the folds (61).To carry out k-fold cross validation and account for the imbalanced multi-label dataset, high-order iterative stratification was implemented via the scikit-learn 'iterative_stratification' module (62,63).In brief, dataset splits are created while maintaining balanced representation of labels within each fold as much as possible.To evaluate model performance on the imbalanced dataset, HL was selected as the evaluation metric.Building, training and testing the classifiers was carried out in a Jupyter environment, using a 4-core CPU and 2-core GPU-accelerated system.

Results
at the end of the data preprocessing steps, the dataset contained 1,700 rows corresponding to 1,700 patient samples, 100 columns of patient measurements/features serving as the input (X) data and 18 columns of patient outcomes, serving as the target label (y) data.as visible in the histogram plot illustrated in Fig. 2a, there is notable imbalance across the instance of labels.the number of labels per sample is also varied, with most samples assigned one or two labels and very few samples exhibiting more than three labels (Fig. 2B).this can be traced back to the challenge of data collection that was touched upon in the introduction segment; the limited number of patients affects the number of labels (outcomes) that happened to be present among them and were thus recorded.
The output label portion of the dataset was used as input to generate information regarding label interactions and relationships.In the case at hand, the potential outcomes and complications of hospitalized patients with MI are regarded as labels.the results of the exploration of the label space can be summarized in the circular graph of Fig. 3.Each label, i.e., each complication, is represented as a node in the graph, with an edge existing when there is co-occurrence between the labels, weighted by the frequency of co-occurrence.By using the louvain algorithm, a popular community detection method, three clusters are reported, denoted by purple, yellow and light blue colour.
the atrial fibrillation (FIBR_ PREDS) complication (non-lethal) exhibits strong relations with progress of congestive heart failure and asystole, both tagged within the dataset as lethal complications.Notable relations are also reported between atrial fibrillation-relapse of the myocardial infarction, and atrial fibrillation-pulmonary edema (OTEK_ LANC).It is also noteworthy that, as regards lethal outcomes, cardiogenic shock, asystole and ventricular fibrillation have been clustered together, as have pulmonary edema, progress of congestive heart failure and thromboembolism.the lethal complication of myocardial rupture, on the other hand, has been clustered with-and exhibits relations with-the myocardial rupture (RAZRIV) complication label, potentially indicating that patients with myocardial rupture were assigned to both labels up to the lethal outcome.
The candidate algorithms were evaluated according to their performance on the dataset and the results are summarized in Fig. 4. The highest HL was exhibited by the Multi-label Note that the number of labels exhibited by a patient will normally be smaller than the total of possible outcomes, as one patient will not exhibit all possible outcomes.Figure 3. Label graph of the labels represented as nodes, with the thickness of edges corresponding to co-occurrence rate.By using the Louvain algorithm, three clusters are reported, denoted by purple, yellow and light blue colour.Graph nodes sharing a colour belong to the same cluster.The nodes with the same colour have been assigned to the same cluster.k-Nearest Neighbor (MLkNN) classifier, whereas the lowest HL (0.053) was exhibited by the OnevsRest classifier with XGBoost.This intuitive classifier strategy, also known as the one-vs.-all,trains a binary classifier independently for each label, and can be applied to both multi-class and multi-label problems.

Discussion
Predictive classification algorithms have been implemented in the context of various diseases over the past years.During the peak of the recent SARS-CoV-2 pandemic, classification models were designed to tackle various aspects of the disease, such as the detection of viral infection through X-ray imaging and CT scans or the outcome prediction of COVID-19 patients using their recorded characteristics as input (16)(17)(18).This is even more critical under the scope of personalized medicine, as the individual patient profile, which includes features, such as comorbidities, age, ethnic background etc., may affect disease progression and clinical manifestations (19).In a number of cases, several conditions, or labels, may be assigned to a single patient; for example, an individual may be suffering from COVID-19, while also exhibiting cardiovascular issues and high cholesterol levels.In such a case, the challenge of building predictive models can be regarded as a multi-label classification problem, as explored herein.
The development and testing of classifiers is a complex task, particularly in the case of multi-label classification.The comparative evaluation of various algorithms on the myocardial infarction dataset elected the oneVsrest heuristic as the best-performing one.Furthermore, the use of XGBoost as the base classifier enabled the fine-tuning of the model and accelerated the learning process.XGBoost was identified as the best performing algorithm in a study published in 2022 for a similar predictive task.that study aimed to develop and validate a machine learning-based model to predict regional lymph node metastasis in osteosarcoma using data from 1,201 patients, identifying t and M stage, surgery and chemotherapy as significant risk factors and XGBoost as the best performing predictive algorithm for that task (64).
the use of disease datasets is also employed by other frameworks; for example, Tang et al (65) described a Gaussian randomizer-based system for early fundus screening with privacy preserving and domain adaptation, employing a multi-disease dataset.
It would be of interest, as an extension of the proposed framework, to evaluate a set of different base classifiers in the context of the oneVsrest strategy and observe the effect that their substitution may have on the classification performance.the present study focused on a subset of the available algorithms and strategies; therefore, there exist other potential machine-learning components and techniques to be evaluated on this task.In terms of the dataset itself, graph exploration highlighted shared label instances that potentially contain information relevant to myocardial infarction pathophysiology.the label imbalance that marks the dataset constitutes an interesting point in terms of handling a multi-label classification problem.Methods to address imbalanced datasets in multi-label classification have been reviewed elsewhere and include, but are not limited to, random oversampling and undersampling, heuristic oversampling, cost-sensitive learning, and ensemble approaches (26,31).Oversampling is the process of increasing the rate of minority class instances within an imbalanced dataset to compensate for the occurrence of common classes (66).Modern and widely-used techniques, such as the Synthetic Minority over-sampling technique (SMotE) create synthetic data points by using the feature space of the minority class and k-nearest neighbors; however, applying the k-nearest neighbor approach on binary input data such as the dataset at hand would serve no purpose (67).Furthermore, the existence of binary input data excludes the use of SMotENC, the SMotE extension for numerical and categorical features (67).therefore, if an added oversampling step were to be applied, a custom oversampling function would need to be created to increase tail label samples based on the calculated oversampling ratio of the labels.lastly, the concept of errors constitutes an important facet of developing accurate and reliable biomedical classification models.A classifier is subject to two main types of errors, false positives, also known as type I errors, and false negatives, also known as type II errors (59).In the case of false positives, the classifier predicts a label which is not present in the test set, while in the case of false negatives, a label that should have been predicted is missing.Similarly, true positives are results where the classifier has correctly predicted a label presence and in true negatives, the classifier has correctly predicted the absence of label, i.e., the absence of the positive instance.In the case of disease complication predictions, we are greatly invested in limiting the rate of false negatives, where the classifier fails to predict a label (a complication) which in truth exists.One could argue that in the hospital setting, it would be less damaging to monitor a patient in anticipation of a complication that turns out to be a false positive, than failing to catch a complication that may be lethal.therefore, the selection of performance metrics and the penalties enforced on the errors of the model are greatly affected by the nature of the disease which we wish to interpret as a classification task.
In conclusion, MI constitutes a highly frequent phenomenon in the subset of the population suffering from cardiovascular problems.the development of accurate and scalable systems to support the decision-making process of the medical professionals in the hospital can alleviate the weight of patient management and may potentially increase the odds of survival for myocardial infarction patients.the use of predictive systems for disease-related challenges has been garnering attention the past years with the increase in computational power and novelty of algorithms.
the data-driven approach presented herein and the obtained results underline the potential of machine learning applications in risk predictions, in particular for the challenge associated with MI.As demonstrated through the evaluation, high-performance algorithms, such as the Extreme Gradient Boosting algorithm can be employed as base classifiers in the context of machine-learning model development, while disciplines such as graph theory can shed light into the elaborate networks underlying myocardial infarction progression.Public dataset repositories can provide the large-scale quantities of biomedical and patient data that are required to build efficient and reliable predictive classification models.This data-driven approach can be further scaled and enhanced; there exists promise in the use of ensemble models, made up of different classifiers with different aptitude towards predicting specific labels.Overall, the classifier-based pipeline holds the potential to support the decision-making process of healthcare professionals and aid a proactive approach to patient care.

Figure 1 .
Figure 1.Overview of the steps of the pipeline used herein.

Figure 2 .
Figure 2. (a) Bar histogram depicting the distribution of label instances across the dataset.(B) Bar histogram representing the distribution of total label numbers assigned to each sample (a patient).Note that the number of labels exhibited by a patient will normally be smaller than the total of possible outcomes, as one patient will not exhibit all possible outcomes.

Figure 4 .
Figure 4. Chart reflecting the candidate classification algorithms and their respective performance as measured through the hamming loss metric.CC, Classifier Chains; BRkNN, Binary Relevance k-Nearest Neighbor; RF, Random Forest; MLkNN, Multi-label k-Nearest Neighbor.